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Abstract 

We propose a novel statistical model to answer three challenges in direct marketing: which channel 
to use, which offer to make, and when to offer. There are several potential applications for the pro¬ 
posed model; for example, developing personalized marketing strategies and monitoring members’ needs. 
Furthermore, the results from the model can complement and can be integrated with other existing 
models. 

The proposed model, named Dynamic Propensity Model, is a latent variable time series model that 
utilizes both marketing and purchase histories of a customer. The latent variable in the model represents 
the customer’s propensity to buy a product. The propensity derives from purchases and other observable 
responses. Marketing touches increase a member’s propensity, and propensity score attenuates and 
propagates over time as governed by data-driven parameters. To estimate the parameters of the model, 
a new statistical methodology has been developed. This methodology makes use of particle methods 
with a stochastic gradient descent approach, resulting in fast estimation of the model coefficients even 
from big datasets. The model is validated using six months’ marketing records from one of the largest 
insurance companies in the U.S. Experimental results indicate that the effects of marketing touches vary 
depending on both channels and products. We compare the predictive performance of the proposed 
model with lagged variable logistic regression. Limitations and extensions of the proposed algorithm are 
also discussed. 


1 Introduction 

Direct marketing is a form of data-driven advertising that markets straight to potential consumers through 
various marketing channels. Direct Marketing Association (2011) reported that in 2010, business and non¬ 
profit organizations spent $153 billion on direct marketing, which is approximately 54% of all advertisement 
expenditures in the United States. The scope of direct marketing channels has expanded from traditional 
direct mails to targeted online display ads, search, and social media sites, and the sizes of consumer databases 
have significantly increased. In recent years, web-service companies, such as Google and Yahoo, have applied 
large-scale data mining and machine learning techniques to online ad optimization e.g. Perlich et al. (2013) 
detail one such system, Yan et al. (2009) empirically showed that behavioral targeting improves click-through 
rates, Gupta et al. (2012) applied matrix factorization for display advertisement matching, Khanna et al. 
(2012) developed specialized methods for large scale modeling for targetted ads on hadoop, and in a recent 
KDD Workshop on Data Mining for Advertising, Shen et al. (2008) covered technological advances in on¬ 
line direct marketing. The landscape of direct marketing is dynamically changing and expanding to new 
applications. 

Two statistical models are particularly popular for direct marketing: uplift modeling and RFM (Recency, 
Frequency, and Monetary value analysis). Uplift modeling, also known as incremental modeling or net-lift 
modeling, aims to measure the effectiveness of advertisement through randomized control and test groups 
(Radcliffe and Surry, 1999; Lo, 2002). Although uplift modeling can provide strong return on investment 
cases for some applications, a large number of control and test samples are needed to estimate extremely 
low response rates e.g. insurance purchases. While uplift modeling predicts the difference that a marketing 
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touch makes, RFM is a marketing-agnostic summary statistics for describing customers (Fader et al., 2005). 
In the RFM modeling, a customer is represented using three variables; the time of the most recent purchase, 
the frequency of purchases, and average spending per purchase. McCarty and Hastak (2007) compared the 
predictive performances of RFM, CHAID, and logistic regression, and concluded that RFM was comparable 
to the other two methods. However, since RFM is marketing-agnostic, high-lift customers, whose purchase 
behaviors are significantly affected by marketing touches, tend to be neglected when RFM is used as the 
primary targeting method. 

State space models, although not popular in direct marketing literature, can potentially address three 
main challenges in direct marketing: which channel to use, which offer to make, and when to offer. A state 
space model, also known as dynamic linear model, is a latent variable time series model that represents 
a physical system as a set of input, output, and (latent) state variables. In direct marketing, the input 
and output refer to the marketing touch and the response respectively, and the state can be interpreted 
as propensity to purchase a product. Although state space models have been widely adopted in diverse 
academic domains such as machine learning, economics, and finance (see Section 2.2), there have been few 
applications in direct marketing to the best of our knowledge. Dekimpe and Hanssens (2000) provide two 
reasons for the poor adoption of time series modeling in marketing: lack of adequate data sources and lack of 
easy-to-use times series softwares for marketers. Pauwels et al. (2004) summarized potential challenges when 
time series models are applied in marketing. Recently, Dekimpe and Hanssens (2010) noted that time series 
modeling has just started to gain popularity in the marketing science community due to recent developments 
in marketing databases and computational power: Fader et al. (2010) used negative beta-geometric and beta- 
Bernoulli distributions with two latent variables (dead or alive) to model recursive discrete-time donation 
behavior, and recently, Lee et al. (2014) adopted a state space model to measure brand inertia. 

In this paper, we present a new state space model for direct marketing, namely Dynamic Propensity 
Model (DPM). The proposed model is a latent variable time series model that utilizes both marketing and 
purchase histories of a customer where the latent variable (state) represents a member’s propensity to buy a 
product. Our model tracks an individual’s marketing touches and responses, then estimates his personalized 
probabilities for the first purchase at daily resolution. Marketing histories contain the records of when and 
how many emails, direct mails, phone calls, display ads, and referral are sent or clicked. In the model, such 
marketing touches increase an individual’s propensity, and this propensity score attenuates and propagates 
over time. We use six months’ marketing records from one of the largest insurance companies in the U.S. 
The degrees of the marketing effects and propagation strengths are statistically estimated from the data. To 
estimate the parameters of the model, a new statistical methodology has been developed, which combines 
particle methods and a stochastic gradient descent approach. 

The design of DPM was, in fact, motivated by several findings. Initially, our research goal was to find the 
retrospective correlation between marketing touches and purchase activities using generalized linear models. 
During the investigation, however, we have observed that some of the marketing touches have negative effects 
on purchase behaviors (see Section 5). For some marketing practices, it is possible that some customers may 
be annoyed by frequent marketing touches. Moreover, there are several aspects of the data that couldn’t be 
addressed by a simple generalized model: 

• Direct mails, emails, promotion phone calls are sent out to a strategically filtered group of customers. 
The company targeted persuadable customers, whose actuarial purchase probabilities are expected to 
be positively impacted by marketing touches. 

• Control and test groups are not readily identifiable for semi-targetable marketing events such as display 
ads and referrals. Uplift modeling cannot directly measure the effects of such semi-targetable marketing 
touches. 

• Past marketing touches have some effect on purchase activities. Appropriate effect time windows need 
to be determined to accurately estimate the true effects of marketing touches. 

To address these findings, we needed a model that can track individual customers with different marketing 
histories. The model also needed to provide consistent interpretations for building marketing strategies and 
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Table 1: Notation. 


Symbol Explanation 
yl purchase indicator (output) 

y\ integer vector of non-actionable marketing touches (input) 

m) integer vector of actionable marketing touches (input) 

x\ filtered propensity (latent state) 

s\ predictive propensity (auxiliary latent state) 

£\ standard normal noise i.e. £\ ~ N(0,1) 

c offset parameter for propensity 

(j) damping factor for propensity 

ot coefficient vector for semi-targetable marketing touches e.g. display ads 

/3 coefficient vector for targetable marketing touches e.g. direct mail 

6 parameter vector i.e. 6 = (c, </>, ot,f3) 

Pr probability measure 

C loss function or objective function 


monitoring customers’ needs. We view a purchase as an instantiation of marketing satisfaction (Westbrook, 
1987), rather than a cognitive decision making process involving the semantic meaning of product attributes 
(Oliver, 1980). A dynamic process of satisfaction (Fournier and Mick, 1999) naturally leads to algebraic 
linkages between temporally close satisfaction variables. These findings and objectives led to the construction 
of our Dynamic Propensity Model. 

Table 1 shows the mathematical notation used in this paper. The rest of this paper is organized as 
follows: In Section 2, we cover the basics of particle methods, parameter estimation techniques in state- 
space models, and stochastic gradient descent. In Section 3, we formally introduce Dynamic Propensity 
model, and investigate the properties of the model. The parameter estimation method for the proposed 
model is provided In Section 4, and empirical results from the real-life dataset are illustrated in Section 5. 
Finally, we discuss the limitations of the proposed method and future work in Section 6. 


2 Background 

In this section, we cover related work on particle methods, parameter estimation techniques in state-space 
models, and basics of stochastic gradient descent. These techniques form the building blocks of our proposed 
model, DPM, which is a latent variable time series model for large-scale enterprise size data. We found 
that traditional parameter estimation approaches for SSMs become intractable when applied to the size of 
our dataset. As a result, we developed a new parameter estimation technique by adopting (1) Monte Carlo 
simulation methods, and (2) stochastic optimization algorithms for the DPM objective function. 

2.1 Particle Methods 

If both the observation and the latent variables are normally distributed, the optimal filtering is known 
to be the Kalman Filter (KF) (Kalman, 1960). For non-linear systems, several approximation techniques 
based on linearization, such as Extended KF (first-order approximation) and Unscented KF (second-order 
approximation), can be applied. However, such linearization usually causes non-diminishing bias, and even 
worse, the algorithms are typically difficult to implement and tune correctly (Julier and Uhlmann, 2004). 
Particle methods (Gordon et ah, 1993) use a different kind of approximation technique: Monte Carlo sim¬ 
ulation. Unlike those variants of KF, the state estimates from particle methods can be made arbitrarily 
accurate with enough particles. Particle methods are based on a sequence of importance sampling steps. 
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Resampling techniques (Liu and Chen, 1998) are typically adopted to decelerate the degeneracy of particles. 
Also, Auxiliary Particle Filter has been developed to prevent the degeneracy of the Sequential Monte Carlo 
(Pitt and Shephard, 1999). Particle methods are powerful general state-space estimation techniques that are 
widely applicable to non-linear evolution and observation processes (Doucet and Johansen, 2008). In this 
paper, we use a particle filter with resampling to estimate a sequence of dynamic propensity scores. 

2.2 Latent Variable Time Series Models 

Latent variable time series models are categorized into two classes of models: state-space models (continuous 
latent variable) and hidden Markov models (discrete latent variable). Latent variables are effective for sum¬ 
marizing past observations, capturing an underlying dynamics, and providing human-interpretable results 
from complex observations (Ho et ah, 2013). Raghavan et al. (2012) developed a hidden Markov model for 
modeling activity profiles of terrorist activities. Xing et al. (2010) developed a state space model to capture 
time altering networks. Valentini et al. (2013) employed a spatially structured factor analysis to model house 
prices, while Nagaraja et al. (2011) used an autoregressive technique to capture the time effect. Aktekin 
et al. (2013) used a Bayesian state space model to estimate mortgage default risk. 

In this paper, we use a state space model to capture the (unobserved) propensity^. Parameter estimation 
techniques for SSMs fall into three main groups: Bayesian online, maximum-likelihood offline, and maximum- 
likelihood online settings (Kantas et ah, 2009). In the Bayesian online setting, model parameters are assumed 
to be dynamic over time series, and they are sequentially estimated. Some of the successful algorithms are 
Liu-West Filter (Liu and West, 2001), Storvik Filter (Storvik, 2002), and Particle Learning (Carvalho et ah, 
2010). For the maximum-likelihood online setting, Andrieu et al. (2005) have demonstrated an online 
estimation algorithm using block time series and pseudo-likelihood. Recall that the parameters in this paper 
are fixed but unknown. 

In the offline (or batch) maximum-likelihood setting, two approaches have been popular: Fisher’s scoring 
and Expectation-Maximization (EM). The Fisher’s scoring algorithm is a variant of Newton-Raphson algo¬ 
rithm based on the log-likelihood function. However, obtaining the log-likelihood of a time series is typically 
intractable. Doucet and Tadic (2003) proposed a general approach for approximating the log-likelihood using 
particle methods. Although this Fisher’s scoring algorithm is generally applicable to several settings, it is 
difficult to scale the gradients for high dimensional parameters (Kantas et ah, 2009). The EM algorithm is 
numerically more stable and usually computationally cheaper for high dimensional parameters. For a Gaus¬ 
sian SSM, the EM algorithm can be implemented using Kalman Filter and Smoother (Shumway and Stoffer, 
1982). For non-linear systems, Zia et al. (2008) introduced the EM-PF (EM using Particle Filter) algorithm, 
but many of the assumptions are not applicable in our setting. For categorical time series. Park et al. (2014) 
proposed an efficient parameter estimation algorithm based on a stochastic EM algorithm (Nielsen, 2000). 
However, their problem setting is slightly different from our setting; they assume different latent dynamics 
for individuals, and reasonably balanced class labels. In this paper, we use an alternating maximization 
approach that is similar to the EM algorithm. There are some differences between our approach and the 
EM algorithm, which will be explained in Section 4. 

2.3 Stochastic Gradient Descent 

Stochastic Gradient Descent (SGD) is a large-scale optimization technique for a summation of differentiable 
objective functions. Consider an objective function C{6) based on a parameter 6 where the objective function 
can be written as C{6) = Ci{0). An iterative method like gradient descent can be used to reach a local 

optimum in expectation. If 0*-“^ is the estimate for the parameter in the uth iteration, in the next iteration, 
a gradient descent algorithm produces: 

q(v+i) ^ giv) _ ^(Dv/:(0(’')) = ^ VA(6>(’’)) 


^Discrete latent variables in hidden Markov models cannot model continuous propensity scores. 
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where 7 ^’'^ is a suitable step size. The expectation E{C{9)) is to be calculated over the entire dataset for 
every update of 6. This can be costly for larger datasets. 

A stochastic version of the gradient descent, SGD, is a more practical approach for large-scale learning 
problems (Bottou and Bousquet, 2007). SGD guarantees convergence to the optimal 6 while doing away 
with the costly expectation calculation (Bottou, 1998). The update step for SGD is simpler: 

q{v+i) ^ 

iei 

where Z is a subset of the dataset. The subset can be as small as a single data point, though usually using 
small batches decreases the variance and leads to quicker convergence. In this paper, we use SGD in our 
alternating maximization approach. 


3 Dynamic Propensity Model 

Dynamic Propensity Model (DPM) is a latent variable time series model that utilizes both marketing and 
purchase histories of a customer. The latent variable of DPM represents a member’s propensity to buy a 
product. The model tracks an individual’ marketing touches and responses, and estimates his personalized 
probabilities for the first purchase at daily resolution. These probability scores can be used in multiple 
applications: ( 1 ) predicting when the customer is likely to buy the product (within-customer application) 
and ( 2 ) targeting likely-to-buy customers (across-customer application). 

DPM uses a different modeling strategy from traditional targeting models. Traditional targeting mod¬ 
els estimate cross-sectional probabilities, whereas DPM models longitudinal probabilities. In other words, 
traditional targeting models answer whom to offer, and DPM suggests when to offer. Although these two 
goals may seem fundamentally different, one indirectly indicates the other given a limited marketing budget. 
This connection is shown in Figure 1. In cross-sectional modeling, choosing a subset of customers is typically 
based on the rank orders of probability scores at a given time. For a different cross-section of data, a different 
subset of customers will be chosen. As a result, a customer may or may not be chosen for direct marketing 
at a given time, and this property indirectly determines when to offer. 

Although their outcomes may be superficially similar, each approach has different modeling limitations. 
Gross-sectional models are easy to estimate, but complex effects, such as time-varying effects and customer 
heterogeneity, are difficult to model. On the other hand, longitudinal models can capture dynamic and 
heterogeneous effects, but they typically need more samples for parameter estimation, and sometimes, some 
model parameters are almost intractable to estimate. Due to the complexity of time-series models, cross- 
sectional models are traditionally modified to handle complex effects, instead of directly using time-series 
models. Allenby and Lenk (1994) applied temporally correlated error terms for modeling household pur¬ 
chase behavior. Keane (1997) suggested a variant of discrete choice model that captures both customer 
heterogeneity and temporal dependency. 

Our goal is to build a time series model that can easily capture customer heterogeneity and temporal 
dependency, and furthermore, the model should be simple enough to extend with other covariates. In this 
paper, we demonstrate DPM on one product and associated marketing touches for promoting that product, 
although it can be extended to deal with multiple products through a matrix representation. In other words, 
marketing touches for promoting different products are ignored to keep the simplicity of our illustration. 
Recall that, in DPM, a customer has a latent factor (propensity) that evolves over time. We assume that 
marketing effects can be super-positioned i.e. they additively affect the propensity. 

A direct application of state-space model can capture these listed properties: 

(Filtered Propensity) x\j^i = Cd + 4>dxl + ccJ r\ + m( -|- e\ 

(Tomorrow’s Purchase) Pr( 7 /f,, = 1) =- ^^^ 

1 -bexp(-a;)+i) 

where the superscript i represents the ith customer, and the subscript t indicates variables at time t. The 
subscript d on the model parameters indicates a demographic segment that exhibit similar responses to 
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Figure 1: The connection between whom to offer and when to offer. 


marketing touches i.e. homogeneous marketing response group. Two different types of marketing touches 
are illustrated: semi-targetable rj and targetable ml marketing touches. The targetable marketing touches 
include emails, direct mails, and phone calls. In these cases, marketing can be directly targeted to a particular 
customers. The semi-targetable marketing touches represent display ads and referrals. Although a company 
can control the exposure to such semi-targetable marketing touches, such marketing touches cannot be 
targeted to a specific individual and the exposure involves a certain degree of randomness. Some semi- 
targetable marketing variables contain marketing responses e.g. clicking display ads, and we view that these 
activities increase the propensity. 

This direct application of state space model, however, faces two practical issues as follows: 

• Time difference within a day: Even though and yt are indexed by the same subscript, one 
occurs before the other e.g. in the morning and yt in the afternoon. In this paper, we assume that 
a purchase decision yt is made at the very start of a day, thus m* and affect the next day’s purchase 
decision yt+i- 

• No data anchor on marketing effects: If x is known, then estimating a and /3 only depends 
on X in the naive SSM application. If the estimates of x is noisy, then the error would propagate 
to the parameter estimates. In fact, we have observed that parameters do not converge when using 
this model. The estimated parameters should be anchored on actual data samples, rather than being 
determined by latent variables. 

These issues can be addressed by introducing an auxiliary variable Sj+i between Xt and Xt+i- 

We now introduce the model equations for DPM: 

(Propagated Propensity) xl = si + el 
(Predicted Propensity) s5_|_i = Cd + (jjdxl -f aj rj -|- /3jm) 

(Tomorrow’s Purchase) Pr(?/!, , = 1) = \ . 

1-f exp(-s)+i) 

where e) is drawn from the standard normal distribution. The auxiliary variable Sj+i is deterministic 
given today’s propensity Xt and marketing touches m^ and r*. This auxiliary variable can be viewed as 
the propensity in the evening that directly affects the purchase decision at the very start of the next day. 
Figure 2 shows and compares the graphical representations of the naive SSM application and DPM. 

The auxiliary variable in DPM not only refines the time resolution of purchase process, but also provides 
a new perspective on parameter estimation. As an illustrative example, let us consider an alternating maxi- 
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mization approach for parameter estimation. In the SSM formulation, we basically solve two maximization 
problems: 

max 0 I X, y max 0 \ x 

max X I 0,y max x | 0,y 

where 6 = {c, (j), a, /3}. Note that y in the first maximization problem is removed, since 6 is solely determined 
by X i.e. data-anchor issue. On the other hand, in DPM, we solve three maximization problems with respect 
to 6, X, and s: 

max 9 I X, y,s max 0 | x,y 

max X I 0,y, s max x | 0,y 

max s I 0,x,y (s | 0,x,y) 

where the last maximization problem is trivial as s is determined by x and 9. In fact, these three maximiza¬ 
tion problems are actually two maximization problems. To remove the last maximization problem, we treat 
the auxiliary variable as a nuisance variable. We remove s using the relation sj+i = Cd + ipdXt + otJ'Tt+l^dmt- 
Both the resultant two maximization problems are now anchored by the observation y. The path from xt 
to yt+i in DPM involves the model parameter 9 (see Figure 2). The auxiliary variable in DPM resulted in 
a different set of maximization problems from the one from the SSM formulation. 

The complete joint probability distribution of DPM for a particular customer can be recursively factorized 
as follows: 


Pr(R,M,y,x) 

= Pr(R,M) ]^Pr(yt+i | rt, mt) Pr(a:t | 

marketing strategy ^ customer heterogeneity temporal dependency 

where we removed the superscript i to make notation simple. As can be seen, the model captures both 
customer heterogeneity and temporal dependency. Note that, in DPM, customer heterogeneity originates 
from differences in marketing touches and responses over time. 

In fact, customer heterogeneity can be captured by demographic segments such as age, income, and family 
size. This demographic heterogeneity can affect customers’ base purchase rates and response dynamics. Note 
that, in our DPM formulation, the model parameters are indexed by the demographic segment variable d. In 
practice, we build a separate model for each demographic segment that exhibits homogeneous behavioural 
patterns. For clarity, we focus on one demographic segment from here onwards, dropping the index d. 


4 Parameter Estimation 


The maximum likelihood parameter for DPM is hard to optimize, as can be seen from the likelihood equation: 

N T* 

arg max nnP’'e(2/t+i I 

i=l t=l 


where N is the total number of customers, and T* is the number of days that the tth customer is monitored. 
The likelihood is defined on the observed variables: yl, m(, and r(. To obtain the actual value of the 
likelihood, we need to integrate out the latent states: 


N „ T’ 

argmaxJI / YlPreiyl+i \ xl,rl,ml)Pig{xl \ m(_i)(ix 

i=l t=l 


N „ T‘ 

= argmax^log / PPre(yJ+i I a:?, rj, m()Pre(a;( | m(_i)dx 
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(b) Dynamic Propensity Model 


Figure 2: Graphical Models for the simple SSM and DPM. The path from xt to yt+i in DPM involves the 
model parameter 0, while the SSM path from xt to yt+i is blocked by Xt+i- 


where x represents The inner integral makes optimization of the log-likelihood intractable. This type 

of maximization problem has been traditionally approached by the EM algorithm (Neal and Hinton, 1998). 
The EM algorithm derives a lower bound of the log-likelihood, and then maximizes this lower bound. The 
lower bound is obtained using Jensen’s inequality for any arbitrary distribution Q*(x) : 


N 

E 

2=1 ’ 


nv I xl,rl,ml)Pre{xl \ xl_^rl_^,ml_^) 

: ® Q*(x) 


( 1 ) 


This lower bound is maximized when 


Q*(x) ^YIpMvI+i I xl,rl,ml)PT0ixl \ m)_i) 


i=l 


For dynamic linear systems with Gaussian noise, Q*(x;) can be obtained in a closed form. However, the 
binary purchase indicators cannot be directly modeled as numeric variables, and the form of Q*(x) should 
be numerically approximated. The EM algorithm for DPM can be written as follows: 

1. Initialize 0 

2. Until convergence, 

(a) (E-step) Estimate Q* that maximizes Equation 1 
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(b) (M-step) Estimate 0 that maximizes Equation 1 

This method is practical only for a small N, but not for the size of our data. If we use particle filters for 
the E-step, the number of simulation particles to store is P x illustrative example, if we use 

5000 particles for 200K customers with average 100 observations per customer, a hundred million particles 
need to be stored and flushed at every iteration. Furthermore, the M-step needs to run on the massive size 
of particles, which is also a challenging engineering problem. 

To facilitate practical optimization, we optimize a surrogate instead of the true likelihood function by 
viewing xt as a temporally correlated random offset parameter, rather than a latent variable. This view 
helps us maximize the conditional likelihood function of DPM, instead of the marginal likelihood. Our goal 
is to obtain estimates for both X = {x^,x^,... ,x'^} and 9. The original optimization problem with the 
parameter set 9 now transforms to a new optimization problem with two sets of variables as follows: 

arg max£DPM(0, X) 

0 ,x 

N T" 

= argmax J|J|Pre(2/J+i I xl,rl,Ynl)PTg{xl \ rj_i, mj_i) 

( 2 ) 

N T* 

= argmax^^logPre( 2 /J+i | xl,rl,ml) + \ogPrg{xl \ r*_i, m*_i) 

a.X. ' ^ ^ ^ ^ ^ ^ 

i—1 t=^l ^ V- 

generalized linear model time series 

where Pdpm is the objective function of DPM. As can be seen, the inner integral is removed since X 
is treated as a parameter matrix. The factorized form in Equation 2 provides insightful explanation about 
the objective function. The hrst part of the factorized form is a logistic regression model with customer 
heterogeneity, while the second part explains temporal dependency. This objective function can be also 
viewed as a logistic regression with sophisticated temporal regularization, or a time series model with a 
logistic loss regularization. 

An alternating maximization of the objective function can be written as follows: 

1. Initialize 9 

2. Until convergence on 0, 

(a) Maximize Udpm over X using Particle Methods 

(b) Maximize £dpm over 9 

This procedure is more efficient than the brute-force EM algorithm, since only the most likely single path 
xi:T needs to be stored. However, the maximization over 9 can be still expensive because of large number 
of customer samples. Stochastic Gradient Descent (SGD) is a suitable solution to this type of large-scale 
learning problem. If we apply SGD on the second maximization step, we obtain: 

1. Initialize 9 

2. Until convergence on 9, 

(a) Maximize Udpm over X 

(b) Initialize 0 sgd = 9 

(c) Until convergence on 0 sgd, 

i. Select a subset of customer samples, 

A. Maximize £dpm over 0 sgd 


9 



where ^sgd represents the inner loop parameter in the SGD step. This procedure is the same as the previous 
procedure except the SGD maximization part. To estimate 9, a subset of customers are uniformly sampled 
from the entire training dataset. The gradient of the objective function is calculated based on the subset, then 
the inner loop parameter is incremented by the gradient. This inner loop continues until 0 sgd converges. 

The maximization step of X still requires a particle method that scans the entire training dataset. Note 
that not all x* samples are used in the SGD maximization step. The particle estimates on x* can be generated 
on demand according to the selected subset as follows: 

1. Initialize 9 

2. Until convergence on 0, 

(a) Initialize 0sgd = 9 

(b) Until convergence on 0 sgd, 

i. Select a subset of samples, 

A. Maximize £dpm over X using 9 

B. Maximize >Cdpm over 0sgd 

Although X is estimated only on the selected subset of samples, this procedure has nested loops. The 
inner loop estimates the stochastic gradient parameter, and the outer loop resets the stochastic gradient 
parameter. We observe that the nested loops can be effectively removed, since in practice the inner loop 
converges within one or two iterations for customer samples of size I. 

We now introduce the parameter estimation algorithm for DPM in Algorithm 1. 


Algorithm 1: SGDPM: DPM parameter estimation algorithm using stochastic gradient 

Data: Y 
Result: 9 

Initialize 9; 

while Until convergence on 9 do 

Randomly pick a customer y* = [yi,y 2 , ■ ■ ■, j/r*]; 

9 = 9+ ^^eJ0DPM{9,a.rgmax^^ £dpm( 0 ,x*)); 

end 


The SGDPM algorithm converges almost surely if and Under sufficient 

regularity conditions, the best convergence speed is achieved when ~ (Murata, 1998). The algorithm 
converges to local optima, thus for potentially better solutions, one needs to try out several different initial¬ 
izations. The inner argmax is estimated using particle methods, and the outer gradient can be obtained in 
a closed form. As an illustrative example, we show a gradient form for the damping parameter </>: 


T* 


, ^logPre(y)+i | x), mj) -f logPre(a:) | 




1 


+ exp(sj+i) l-texp(sj+i) 


vt+i _ 1 jQg 2ty(j 


^ _l_ xl exp(c -I- (f)xl + a^rl + P^ml) 

t I ^ exp{c + 4>xl + rl + f3^ml) 

xl_i{xl - c-(jixl^i - a^rl_-^ - (3^ 

' O 




{xl - sl)^ 

2cr2 


The SGD update for the damping parameter (j) is as follows: 


^{v+D ^ +^U)d^£j,pM{9‘^-\x^) 
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where the gradient is added, since we want to maximize the objective function. SGD updates for the other 
parameters can be similarly obtained, we omit detailed equations for conserving space. 


5 Empirical Study 

In this section, we evaluate DPM using a real dataset from one of the largest insurance companies in the U.S. 
We first give an overview of the dataset and its basic statistics, and show evidence for the time dependency 
of marketing effects. Baseline models are constructed using logistic regression models with lagged variables. 
Finally, we compare the parameters and predictive performances of DPM and the baseline models. 

5.1 Dataset Overview 

The dataset contains six months’ records of marketing touches, responses, and purchases on 13 different 
kinds of financial products, collected over July 2012 - December 2012. There are six different types of 
marketing touches: three semi-targetable rj = rgj, three targetable = {m\f, mlt, ^ 34 ) marketing 

touches. Marketing touch types are masked due to confidentiality reasons. The records are samples from 
the company’s customer database, stratified based on products and customers who have not yet bought 
the respective product in the beginning of our six months time window. The company believes that, if 
personalized marketing touches can engage these new customers, then they tend to turn into loyal customers. 
Among the 13 different product types, for illustrative purpose, DPM is applied to the first purchases of two 
products: Product A and Product B. Customers are partitioned into training (50%) and test datasets (50%). 

Table 2: Data Format 


id 

time 

r.l 

r.2 

r.3 

TO.l 

m.2 

m.3 

y 

1847410 

July 1 

0 

0 

0 

1 

0 

0 

0 

1847410 

July 2 

0 

0 

0 

0 

0 

0 

0 

1847410 

Dec 2 

1 

2 

0 

0 

0 

0 

0 

1847410 

Dec 3 

1 

0 

0 

0 

0 

0 

1 (purchase) 

1352132 

July 1 

0 

0 

0 

0 

0 

0 

0 


Table 3: Dataset Overview 


Product A 
(70K customers) 
Variable Mean Max 

Product B 
(20K customers) 
Mean Max 

ri 

0.0010 

2 

0.0115 

3 

r 2 

0.0044 

3 

0.0057 

2 

rs 

0.0004 

1 

0.0027 

2 

mi 

0.0165 

1 

0.0168 

1 

m2 

0.0354 

1 

0.0229 

1 

m3 

0.0003 

1 

0.0032 

1 

y 

0.0001 

1 

0.0004 

1 
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Figure 3: Evidence of time-lagged marketing effects. Each cell represents a different marketing touch. 

The format of the data is shown in Table 2. Each row is identified by the combination of IDs and time 
stamps. Marketing touches are recorded as counts i.e. r.2 = 2 means two events of the same marketing 
type for a day. Table 3 shows basic statistics of the data. The dataset for Product A contains about 70,000 
customers, and the dataset for Product B has about 20,000 customers. Both the marketing touches and the 
targets are highly sparse. For example, daily purchase rate of Product A is 0.01%, and 0.04% for Product 
B. 

One of the main motivations for designing DPM was the time dependency of marketing effects. The 
positive effects of past marketing touches can be easily illustrated. Figure 3 shows the histogram of the day 
of the last marketing exposure before purchase. The x-axis represents time at a day resolution, where a; = 0 
is the time of purchase. As can be seen, past marketing events are correlated to purchases. Noticeably, the 
temporal effects show exponential decay, which suggest an underlying autoregressive process. 

5.2 Baseline Models 

We build three logistic regression models using lagged variables as follows: 

(glm) : E[yt+i \ vt,mt] = logif^ (c-I-a([rt + 

1 1 

(glm.lagl) : E\yt+i \ = logit“^(c-I-^ ^ 

1=0 1=0 

2 2 

(glm.lag2) : Elyt+i \ {c+ rt-i/Sj 

1=0 1=0 

where O; and /3; represent the effects for the Flagged marketing touches. These models are estimated using 
the glm method in R 2.15.3. Since our datasets have highly imbalanced target ratios, the parameters of these 
models with lagged variables, especially glm.lagl, sometimes do not converge. We do not show the models 
with deep lagged variables {I > 2), since those models are much harder to estimate. 

5.3 Estimated Parameters 

We estimate the parameters of DPM using the SGDPM algorithm. Figure 4 shows the estimated parameters 
over iterations. Recall that each iteration in SGDPM represents a data block of one customer. The customers 
who purchased the products are selected with a higher probability, since otherwise SGDPM may not see 
any positive examples before it converges. As can be seen, the parameters almost converge after visiting 
approximately two thousand customers. The damping parameters for Product A and Product B are 0.36 
and 0.53, respectively. Marketing effects diminish by about half in the next day, but they still positively 
affect purchases. 

The estimated parameters from the baseline models are significantly different from the DPM parameters. 
Figure 5 compares the coefhcients from two different models: DPM and glm. As can be seen in the figure. 
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Figure 4: SGD estimation of the DPM parameters. 


glm outputs negative weights for some of the marketing effects e.g. /3oi < 0 and ^03 < 0 for Product 
A, and /3oi < 0 and /3o2 < 0 for Product B. These negative coefficients mainly appear on the targetable 
marketing touches mt. A viable explanation for this is that the company is already targeting a specific 
group of customers whose purchase rates are lower than normal population. On the other hand, DPM shows 
all positive coefficients for the marketing touches. DPM tracks customers’ marketing histories, and adjusts 
the current offsets Xt for purchases. Thus, the coefficients from DPM are more robust for sampling biases 
than the baseline models. 

Table 4 shows the estimated parameters of the three baseline models for Product B. The negative ef¬ 
fects have higher p-values compared to the positive effects. For example, the p-value of glm.lagl /3i2 is 
0.974. Some of the lagged variables have significant effects on purchases, which supports our claim that past 
marketing touches affect purchase behaviors. These GLM models put more weights on the semi-targetable 
marketing touches than the targetable marketing touches. Unless the effects of the targetable marketing 
touches are estimated using control and test groups, these results do not provide insights for building ac¬ 
tionable marketing strategies. 
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Figure 5: Estimated Parameters from DPM and GLM. 


Table 4: Estimated Parameters for Product B from Generalized Linear Models. 



glm 

glm.lagl 

glm.lag2 

Par am. 

Estimate 

p-value 

Estimate 

p-value 

Estimate 

p-value 

aoi 

2.02031 

< 2e-16 

1.90741 

< 2e-16 

1.84673 

< 2e-16 

002 

2.74625 

< 2e-16 

2.66615 

< 2e-16 

2.59654 

< 2e-16 

003 

3.16096 

< 2e-16 

3.14794 

< 2e-16 

3.18083 

< 2e-16 

boi 

-0.59591 

0.185 

-0.59042 

0.18922 

-0.59176 

0.188291 

Po2 

-0.32632 

0.308 

-0.31584 

0.32377 

-0.33462 

0.296802 

bos 

1.30361 

1.74e-05 

1.25721 

4.05e-05 

1.25432 

4.52e-05 

Oil 

- 

- 

0.66684 

0.00018 

0.51507 

0.007232 

Ol2 

- 

- 

0.60640 

0.05751 

0.55132 

0.086354 

Ol3 

- 

- 

1.63553 

5.69e-05 

1.65591 

4.32e-05 

bn 

- 

- 

-1.25127 

0.21076 

-1.24799 

0.212039 

bl2 

- 

- 

-0.01249 

0.97401 

-0.04239 

0.911988 

bl3 

- 

- 

1.21296 

0.01544 

1.10564 

0.029452 

021 

- 

- 

- 

- 

0.52325 

0.013828 

022 

- 

- 

- 

- 

0.52655 

0.149112 

023 

- 

- 

- 

- 

1.53267 

0.000412 

bsi 

- 

- 

- 

- 

-0.11958 

0.836673 

b22 

- 

- 

- 

- 

0.62343 

0.025465 

b23 

- 

- 

- 

- 

1.71137 

2.53e-05 
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Figure 6: Receiver Operating Characteristic curves from the test datasets. 


5.4 Predictive Performance 

We measure the predictive performances of DPM and the baseline models using hold-out test datasets. The 
customers in the test datasets do not overlap with the customers in the training datasets. The performances 
were measured using Receiver Operating Characteristics (ROC) curves. Figure 6 shows the measured ROC 
curves for both Product A and B. As can be seen, the DPM’s areas under the curves are significantly higher 
than those of the baseline models. 

Although their performances are comparable when False Positive Ratios are close to zero (FPR « 0), the 
baseline models cannot capture the purchase probabilities of the customers who had (latent) intentions of 
purchasing the products. In other words, the logistic regression models are trained directly on the observed 
variables, and there is no latent variable involved. On the other hand, DPM, a latent-variable time series 
model, simultaneously estimates both the intention of purchasing the products and the parameters of the 
model. The latent intentions of purchasing are captured through marketing touches and their corresponding 
responses: the customers who clicked display ads, who received referrals, or who positively responded to 
promotion phone calls. By modelling these latent intentions, DPM achieves better performance curves than 
traditional lagged variable model. 

The visualization of the latent variable provides a different perspective on purchase behaviors. Figure 7 
shows the dynamic purchase probabilities from two models of a customer. A customer’s propensity to 
purchase propagates and accumulates over time. The predicted purchase probability of DPM shows the 
process of building up a purchase decision. On the 28th and 29th day, the customer received and responded 
to the ri marketing touch, and on the 30th day, he was targeted by the m 2 marketing touch. Through the 
process of marketing touches and responses, his propensity to purchase Product B has gradually increased. 
On the other hand, the baseline model (glm) does not capture this cumulative process, and it actually 
provides a lower probability score on the purchase day than the day before. 


6 Concluding Remarks 

In this paper, we proposed a state space model, DPM, to answer three main challenges in direct marketing: 
which channel to use, which offer to make, and when to offer. DPM is a latent variable time series model that 
utilizes demographics, and both marketing and purchase histories of a customer. To estimate the parameters 
of the model, a new statistical methodology, SGDPM, was developed. This methodology combines state- 
space models with a stochastic gradient descent approach, resulting in fast estimation of the model coefficients 
from big data. The experimental results using a real dataset showed that DPM can effectively forecast the 
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Figure 7: Visualization of dynamic purchase probability. 


time of purchase. 

We used only six marketing variables to focus the modeling idea of DPM: three semi-targetable and three 
targetable marketing touches. In practice, combining demographic information with DPM can significantly 
improve the predictive performance (Risselada et ah, 2013). For example, marketing demographic segments 
can be used in a multi-level modeling approach; each segment would have slightly different parameters. 
Social network information (Palmer and Koenig-Lewis, 2009) can also leverage the predictive performance 
of the model. Specifically, information about customers’ important life events, such as a marriage or a new 
baby, can be easily used to extend the model as follows: 

sl+i =c + (i)x\ + a^rl +f3^ml+ 

life events 

where ej represents a vector of life events. Some life events, such as having a family, may increase the 
propensity of buying financial products {-ip > 0), whereas a better deal from an rival company would decrease 
the propensity (V' < 0). 

Although we showed the predictive performance of DPM using hold-out datasets, the true predictive 
performance of the model should also be confirmed by thorough A/B-testing or randomized controlled 
experiments. The scores from DPM and other existing models can also be combined to increase purchase 
rates. To build a personalized marketing strategy, customer feedback needs to be fully utilized. An IT 
infrastructure for daily monitoring of customers should be the first step for these types of dynamic models 
in practice. Extensions of our approaches and further discussions on implementation methods are left as 
future work. 
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